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Systems Review , ACM SIGPLAN Notices , Proceedings of the fourth 
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operating systems ASPLOS-V, Volume 27 issue 9 
Publisher: ACM Press 

Full text available: |g]pdf(1.52 MB) Additional Information: full citation , references , citings , index terms 




4 Cache coherence in large-scale shared-memory multiprocessors: issues and 
^ comparisons 
^ David J. Lilja 

September 1993 ACM Computing Surveys (CSUR), Volume 25 issue 3 
Publisher: ACM Press 



http://portal.acm.org/resultsxfm?CFID=69083944&CFTOKEN=2338313 2/17/06 



Results (page 1): +multiprocessor, +allocation 5 +sequential +memory, +bypass, +operatin... Page 2 of 6 
Full text available: ^C| pdf(3.12 MB) Additional Information: full citation , references , citings , index terms 



5 Architectural primitives for a scalable shared memory multiprocessor 
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Su pporting dynamic data structures on distributed-memory machines 
Anne Rogers, Martin C. Carlisle, John H. Reppy, Laurie J. Hendren 

March 1995 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 17 Issue 2 
Publisher: ACM Press 

Full text available* fSi odf(2 05 MB) Additional Information: full citation , abstract , references , citings , index 

" terms , review 

Compiling for distributed-memory machines has been a very active research area in recent 
years. Much of this work has concentrated on programs that use arrays as their primary 
data structures. To date, little work has been done to address the problem of supporting 
programs that use pointer-based dynamic data structures. The techniques developed for 
supporting SPMD execution of array-based programs rely on the fact that arrays are 
statically defined and directly addressable. Recursive data s ... 
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Hiding memory latency using dynamic scheduling in shared-memory multiprocessors 
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Full text available* f Spdf(1.65 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

The large latency of memory accesses is a major impediment to achieving high 
performance in large scale shared-memory multi-processsors. Relaxing the memory 
consistency model is an attractive technique for hiding this latency by allowing the overlap 
of memory accesses with other computation and memory accesses. Previous studies on 
relaxed models have shown that the latency of write accesses can be hidden by buffering 
writes and allowing reads to bypass pending writes. Hiding the latency of ... 
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terms , review 

The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory 
multiprocessors. Its architecture is based on the principle that, in order to maximize 
performance for applications with diverse requirements, a file system must support a wide 
variety of file structures, file system policies, and I/O interfaces. Files in HFS are 
implemented using simple building blocks composed in potentially complex ways. This 
approach yields great flexibility, allowing an application ... 

Keywords: customization, data partitioning, data replication, flexibility, parallel 
computing, parallel file system 
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Publisher: ACM Press 

Full text available: fifl pdf(2.44 MB) Additional Information: full citation, abstract , references, citings, index 
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Current microprocessors aggressively exploit instruction-level parallelism (ILP) through 
techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent 
work has shown that memory latency remains a significant performance bottleneck for 
shared-memory multiprocessor systems built of such processors.This paper provides the 
first study of the effectiveness of software-controlled non-binding prefetching in shared 
memory multiprocessors built of state-of-the-art ILP-based proces ... 

11 External memory algorithms and data structures: dealing with massive data 

Jeffrey Scott Vitter 

June 2001 ACM Computing Surveys (CSUR), Volume 33 issue 2 
Publisher: ACM Press 

Full text available* f9 pdf(828 46 KB) Additional Information: full citation , abstract , references , citings , index 
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Data sets in large applications are often too massive to fit completely inside the computers 
internal memory. The resulting input/output communication (or I/O) between fast internal 
memory and slower external memory (such as disks) can be a major performance 
bottleneck. In this article we survey the state of the art in the design and analysis of 
external memory (or EM) algorithms and data structures, where the goal is to exploit 
locality in order to reduce the I/O costs. We consider a varie ... 

Keywords: B-tree, I/O, batched, block, disk, dynamic, extendible hashing, external 
memory, hierarchical memory, multidimensional access methods, multilevel memory, 
online, out-of-core, secondary storage, sorting 
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Full text available: - gjpdff2 13.94 KB) AdditionaI information: full citation, abstract, references, citings, index 

Dynamic memory allocators (malloc/free) rely on mutual exclusion locks for protecting the 
consistency of their shared data structures under multithreading. The use of locking has 
many disadvantages with respect to performance, availability, robustness, and 
programming flexibility. A lock-free memory allocator guarantees progress regardless of 
whether some threads are delayed or even killed and regardless of scheduling policies. 
This paper presents a completely lock-free memory allocator. It uses ... 

Keywords: async-signal-safe, availability, lock-free, malloc 



13 Trace-driven memory simulation: a survey 
ygs^ Richard A. Uhlig, Trevor N. Mudge 

N/ June 1997 ACM Computing Surveys (CSUR), Volume 29 issue 2 
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As the gap between processor and memory speeds continues to widen, methods for 
evaluating memory system designs before they are implemented in hardware are 
becoming increasingly important. One such method, trace-driven memory simulation, has 
been the subject of intense interest among researchers and has, as a result, enjoyed rapid 
development and substantial improvements during the past decade. This article surveys 
and analyzes these developments by establishing criteria for evaluating trac ... 

Keywords: TLBs, caches, memory management, memory simulation, trace-driven 
simulation 



14 GTW: a time warp system for shared memory multiprocessors 

Samir Das, Richard Fujimoto, Kiran Panesar, Don Allison, Maria Hybinette 
December 1994 Proceedings of the 26th conference on Winter simulation 
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15 Instruction prefetching of systems codes with layout optimized for reduced cache 
misses 

Chun Xia, Josep Torrellas 
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annual international symposium on Computer architecture ISCA '96, Volume 

24 Issue 2 
Publisher: ACM Press 

Full text available* fPpdf(1.65 MB) Additional Information: full citation , abstract , references , citings , index 
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High-performing on-chip instruction caches are crucial to keep fast processors busy. 
Unfortunately, while on-chip caches are usually successful at intercepting instruction 
fetches in loop-intensive engineering codes, they are less able to do so in large systems 
codes. To improve the performance of the latter codes, the compiler can be used to lay out 
the code in memory for reduced cache conflicts. Interestingly, such an operation leaves 
the code in a state that can be exploited by a new type of ... 




http://portal.acm.org/resu^ 2/17/06 



Results (page 1): +multiprocessor, +allocation, +sequential +memory, +bypass, +operatin... Page 5 of 6 



16 Data prefetch mechanisms 

Steven P. Vanderwiel, David J. Lilja 

June 2000 ACM Computing Surveys (CSUR), Volume 32 issue 2 
Publisher: ACM Press 

Full text available- fiD pdfd 72.07 KB) Additional Information: full citation , abstract , references , citings , index 
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The expanding gap between microprocessor and DRAM performance has necessitated the 
use of increasingly aggressive techniques designed to reduce or hide the latency of main 
memory access. Although large cache hierarchies have proven to be effective in reducing 
this latency for the most frequently used data, it is still not uncommon for many programs 
to spend more than half their run times stalled on memory requests. Data prefetching has 
been proposed as a technique for hiding the access lat ... 

Keywords: memory latency, prefetching 



17 Multigrain shared memory 

Donald Yeung, John Kubiatowicz, Anant Agarwal 

May 2000 ACM Transactions on Computer Systems (TOCS), volume 18 issue 2 
Publisher: ACM Press 

Full text available: HP pdf(369.18 KB) Additional Information: full citation , abstract, references , index terms , 

review 

Parallel workstations, each comprising tens of processors based on shared memory, 
promise cost-effective scalable multiprocessing. This article explores the coupling of such 
small- to medium-scale shared-memory multiprocessors through software over a local 
area network to synthesize larger shared-memory systems. We call these systems 
Distributed Shared-memory Multiprocessors (DSMPs). This article introduces the design of 
a shared-memory system that uses multiple granularities of sharing, ca ... 

Keywords: distributed memory, symmetric multiprocessors, system of systems 





18 A performance study of memory consistency models 
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annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 
Publisher: ACM Press 

Full text available* fi!|pdf(1.17 MB) Additional Information: full citation , abstract , references , citings , index 
' " terms 

Recent advances in technology are such that the speed of processors is increasing faster 
than memory latency is decreasing. Therefore the relative cost of a cache miss is 
becoming more important. However, the full cost of a cache miss need not be paid every 
time in a multiprocessor. The frequency with which the processor must stall on a cache 
miss can be reduced by using a relaxed model of memory consistency. In this paper, we 
present the results of instruction-level simulation s ... 
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Parallel programs written in MPI have been widely used for developing high-performance 
applications on various platforms. Because of a restriction of the MPI computation model, 
conventional MPI implementations on shared-memory machines map each MPI node to an 
OS process, which can suffer serious performance degradation in the presence of 
multiprogramming. This paper studies compile-time and runtime techniques for enhancing 
performance portability of MPI code running on multiprogrammed share ... 

Keywords: MPI, lock-free synchronization, multiprogrammed environments, program 
transformation, shared-memory machines, threaded execution 
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